Zurich - 27 & 28 June 2022
The methods to deal with missing data, implicitly assume a missing data mechanism.
MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.
MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).
MNAR: least strict assumption.
Complete-case analysis (CCA): only the cases with observed data for all variables involved are used in the analysis.
Parameter estimates:
| estimate | std.error | p.value | |
|---|---|---|---|
| Complete | 0.22 | 0.09 | 0.02 |
| MCAR | 0.24 | 0.17 | 0.16 |
| estimate | std.error | p.value | |
|---|---|---|---|
| Complete | 0.22 | 0.09 | 0.02 |
| MAR | 0.09 | 0.15 | 0.53 |
Imputation: replacing the missing with a value.
Possible values to impute?
Other examples:
| Gender | n | n | mean | sd | n | mean | sd |
|---|---|---|---|---|---|---|---|
| Man | 81 | 73 | 86.009 | 9.660 | 73 | 7.998 | 3.539 |
| Woman | 61 | 50 | 75.782 | 8.478 | 52 | 9.782 | 3.238 |
Parameter estimates:
Conditional mean imputation: estimate the imputed value as a predicted value from a regression model: \[Y_{imp} = \beta_0 + \beta_1 * X\]
Parameter estimates:
Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender\)
Stochastic regressions: regression imputation, with additional sampling error added to the predicted value:
\[Y_{imp} = \beta_0 + \beta_1 * X + \epsilon\]
Sampling error is normally distributed.
Parameter estimates:
Imputation uncertainty is not taken into account
Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender + \epsilon\)
Ad hoc method for longitudinal data: use the previous observed value to impute the missing values.
Parameter estimates:
> id group 0 3 6 12 24 > 76 1 1 1 1 1 1 1 0 > 16 1 1 1 1 1 1 0 1 > 9 1 1 1 1 1 0 0 2 > 7 1 1 1 1 0 0 0 3 > 2 1 1 1 0 0 0 0 4 > 0 0 0 2 9 18 34 63
The trajectories over time, with LOCF imputations in red.
Below only the cases with missing observations, separated by group.
miceFully Conditional Specification (FCS): imputes variable-by-variable.
Joint modelling: imputes from a multivariate distribution.
mice uses FCS
Per variable with missing data:
Later more on methods
Imputation model = method + predictors
Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.
Iterations for generating the imputations for Solar.R in the first data set:
Iterations for generating the imputations for Solar.R in the second data set:
Iterations for generating the imputations for Solar.R in all five data sets:
Each imputed data set is analyzed, with the substantive analysis model
This results in \(m\) sets of results
Workflow:
> # A tibble: 10 x 6 > term estimate std.error statistic p.value nobs > <chr> <dbl> <dbl> <dbl> <dbl> <int> > 1 (Intercept) 24.8 5.71 4.34 0.0000258 153 > 2 Solar.R 0.0996 0.0278 3.58 0.000455 153 > 3 (Intercept) 21.7 5.62 3.85 0.000173 153 > 4 Solar.R 0.105 0.0271 3.88 0.000153 153 > 5 (Intercept) 21.3 5.53 3.86 0.000165 153 > 6 Solar.R 0.103 0.0267 3.87 0.000164 153 > 7 (Intercept) 21.8 5.55 3.93 0.000127 153 > 8 Solar.R 0.109 0.0271 4.02 0.0000930 153 > 9 (Intercept) 20.0 5.55 3.60 0.000433 153 > 10 Solar.R 0.112 0.0271 4.14 0.0000578 153
> call : > with.mids(data = imp_pmm, expr = psych::describe(Ozone)) > > call1 : > mice(data = airquality, method = "pmm", printFlag = F) > > nmis : > Ozone Solar.R Wind Temp Month Day > 37 7 0 0 0 0 > > analyses : > [[1]] > vars n mean sd median trimmed mad min max range skew kurtosis se > X1 1 153 43.24 32 35 39.19 25.2 1 168 167 1.18 1.11 2.59 > > [[2]] > vars n mean sd median trimmed mad min max range skew kurtosis se > X1 1 153 41.41 31.09 32 37.23 23.72 1 168 167 1.24 1.32 2.51 > > [[3]] > vars n mean sd median trimmed mad min max range skew kurtosis se > X1 1 153 40.6 30.91 31 36.21 22.24 1 168 167 1.32 1.56 2.5 > > [[4]] > vars n mean sd median trimmed mad min max range skew kurtosis se > X1 1 153 41.84 31.63 34 37.8 26.69 1 168 167 1.22 1.35 2.56 > > [[5]] > vars n mean sd median trimmed mad min max range skew kurtosis se > X1 1 153 40.62 31.85 31 36.29 23.72 1 168 167 1.3 1.44 2.57
\[\hat\theta = \frac{1}{m} \sum^m_{i=1}{\theta_i}\]
Between variance:
\[\sigma^2_{between} = \frac{\sum^m_{i=1}(\beta_i - \overline\beta)^2}{m-1}\]
Within variance:
\[\sigma^2_{within} = \frac{\sum^m_{i=1}\sigma^2_i}{m}\]
Total variance:
\[\sigma^2_{total} = \sigma^2_{within} + \sigma^2_{between} + \frac{\sigma^2_{between}}{m}\]
> Class: mipo m = 5 > term m estimate ubar b t dfcom > 1 (Intercept) 5 21.9250591 3.128523e+01 3.114620e+00 3.502277e+01 151 > 2 Solar.R 5 0.1057732 7.371338e-04 2.371589e-05 7.655929e-04 151 > df riv lambda fmi > 1 96.54006 0.11946675 0.10671755 0.1246657 > 2 136.72129 0.03860773 0.03717258 0.0509547
In the imputation phase, imputed values are estimated using an imputation method.
norm.predictnorm.nobnormpmm| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ozone | 1 | 116 | 42.129310 | 32.987884 | 31.5 | 37.797872 | 25.94550 | 1.0 | 168.0 | 167 | 1.2098656 | 1.1122431 | 3.0628482 |
| Solar.R | 2 | 146 | 185.931507 | 90.058422 | 205.0 | 190.338983 | 98.59290 | 7.0 | 334.0 | 327 | -0.4192893 | -1.0040581 | 7.4532881 |
| Wind | 3 | 153 | 9.957516 | 3.523001 | 9.7 | 9.869919 | 3.40998 | 1.7 | 20.7 | 19 | 0.3410275 | 0.0288647 | 0.2848178 |
| Temp | 4 | 153 | 77.882353 | 9.465270 | 79.0 | 78.284553 | 8.89560 | 56.0 | 97.0 | 41 | -0.3705073 | -0.4628929 | 0.7652217 |
| Month | 5 | 153 | 6.993464 | 1.416522 | 7.0 | 6.991870 | 1.48260 | 5.0 | 9.0 | 4 | -0.0023448 | -1.3167465 | 0.1145191 |
| Day | 6 | 153 | 15.803922 | 8.864520 | 16.0 | 15.804878 | 11.86080 | 1.0 | 31.0 | 30 | 0.0026001 | -1.2224406 | 0.7166540 |
Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)
Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.
And updated after the first imputation with the imputed data.
Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)
Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.
And updated after the first imputation with the imputed data.
\(\epsilon\) is normally distributed residual error
Imputed value: \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
\(\epsilon\) is normally distributed residual error
logregpolyregcartImputed value: \(Log\frac{P(Y_{miss})}{1-P(Y_{mis})} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
\(\epsilon\) is normally distributed residual error
Fit a multinomial regression model.
Parameters are drawn from their posterior distribution (Bayesian).
Compute the predicted category.
Add normally distributed residual error to account for sampling variance.
For continuous or categorical variables
micemicelibrary(mice) help(mice)
Most important arguments in mice()
m: number of multiple imputationsmethod: imputation methodpredictorMatrix: matrix where value of 1 means that the column variable is used as a predictor for the target in the rows.maxit: number of iterationsseed: offsetting the random number generatorsummary(nhanes2)
> age bmi hyp chl > 20-39:12 Min. :20.40 no :13 Min. :113.0 > 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0 > 60-99: 6 Median :26.75 NA's: 8 Median :187.0 > Mean :26.56 Mean :191.4 > 3rd Qu.:28.93 3rd Qu.:212.0 > Max. :35.30 Max. :284.0 > NA's :9 NA's :10
imp <-mice(nhanes2, m=5, maxit=10, seed = 800)
> > iter imp variable > 1 1 bmi hyp chl > 1 2 bmi hyp chl > 1 3 bmi hyp chl > 1 4 bmi hyp chl > 1 5 bmi hyp chl > 2 1 bmi hyp chl > 2 2 bmi hyp chl > 2 3 bmi hyp chl > 2 4 bmi hyp chl > 2 5 bmi hyp chl > 3 1 bmi hyp chl > 3 2 bmi hyp chl > 3 3 bmi hyp chl > 3 4 bmi hyp chl > 3 5 bmi hyp chl > 4 1 bmi hyp chl > 4 2 bmi hyp chl > 4 3 bmi hyp chl > 4 4 bmi hyp chl > 4 5 bmi hyp chl > 5 1 bmi hyp chl > 5 2 bmi hyp chl > 5 3 bmi hyp chl > 5 4 bmi hyp chl > 5 5 bmi hyp chl > 6 1 bmi hyp chl > 6 2 bmi hyp chl > 6 3 bmi hyp chl > 6 4 bmi hyp chl > 6 5 bmi hyp chl > 7 1 bmi hyp chl > 7 2 bmi hyp chl > 7 3 bmi hyp chl > 7 4 bmi hyp chl > 7 5 bmi hyp chl > 8 1 bmi hyp chl > 8 2 bmi hyp chl > 8 3 bmi hyp chl > 8 4 bmi hyp chl > 8 5 bmi hyp chl > 9 1 bmi hyp chl > 9 2 bmi hyp chl > 9 3 bmi hyp chl > 9 4 bmi hyp chl > 9 5 bmi hyp chl > 10 1 bmi hyp chl > 10 2 bmi hyp chl > 10 3 bmi hyp chl > 10 4 bmi hyp chl > 10 5 bmi hyp chl
imp$pred
> age bmi hyp chl > age 0 1 1 1 > bmi 1 0 1 1 > hyp 1 1 0 1 > chl 1 1 1 0
Remove a predictor variable, e.g remove “bmi” as predictor for “chl”.
imp$pred["chl","bmi"] <- 0 imp$pred
> age bmi hyp chl > age 0 1 1 1 > bmi 1 0 1 1 > hyp 1 1 0 1 > chl 1 0 1 0
pred_update <- imp$pred
imp$meth
> age bmi hyp chl > "" "pmm" "logreg" "pmm"
Change the imputation method for one variable
imp$meth["bmi"] <- "norm" imp$meth
> age bmi hyp chl > "" "norm" "logreg" "pmm"
meth_update <- imp$meth
imp_update <- mice(nhanes2, m=5, maxit=10, pred=pred_update, meth=meth_update, seed = 90)
> > iter imp variable > 1 1 bmi hyp chl > 1 2 bmi hyp chl > 1 3 bmi hyp chl > 1 4 bmi hyp chl > 1 5 bmi hyp chl > 2 1 bmi hyp chl > 2 2 bmi hyp chl > 2 3 bmi hyp chl > 2 4 bmi hyp chl > 2 5 bmi hyp chl > 3 1 bmi hyp chl > 3 2 bmi hyp chl > 3 3 bmi hyp chl > 3 4 bmi hyp chl > 3 5 bmi hyp chl > 4 1 bmi hyp chl > 4 2 bmi hyp chl > 4 3 bmi hyp chl > 4 4 bmi hyp chl > 4 5 bmi hyp chl > 5 1 bmi hyp chl > 5 2 bmi hyp chl > 5 3 bmi hyp chl > 5 4 bmi hyp chl > 5 5 bmi hyp chl > 6 1 bmi hyp chl > 6 2 bmi hyp chl > 6 3 bmi hyp chl > 6 4 bmi hyp chl > 6 5 bmi hyp chl > 7 1 bmi hyp chl > 7 2 bmi hyp chl > 7 3 bmi hyp chl > 7 4 bmi hyp chl > 7 5 bmi hyp chl > 8 1 bmi hyp chl > 8 2 bmi hyp chl > 8 3 bmi hyp chl > 8 4 bmi hyp chl > 8 5 bmi hyp chl > 9 1 bmi hyp chl > 9 2 bmi hyp chl > 9 3 bmi hyp chl > 9 4 bmi hyp chl > 9 5 bmi hyp chl > 10 1 bmi hyp chl > 10 2 bmi hyp chl > 10 3 bmi hyp chl > 10 4 bmi hyp chl > 10 5 bmi hyp chl
plot(imp)
stripplot(imp)
xyplot(imp, chl ~ bmi)
fit <- with(imp, lm(chl ~ age))
summary(fit)
> # A tibble: 15 x 6 > term estimate std.error statistic p.value nobs > <chr> <dbl> <dbl> <dbl> <dbl> <int> > 1 (Intercept) 182. 10.9 16.6 6.09e-14 25 > 2 age40-59 19.1 18.0 1.06 2.99e- 1 25 > 3 age60-99 39.4 18.9 2.08 4.90e- 2 25 > 4 (Intercept) 177. 11.0 16.1 1.14e-13 25 > 5 age40-59 21.3 18.0 1.18 2.51e- 1 25 > 6 age60-99 41.4 19.0 2.18 4.00e- 2 25 > 7 (Intercept) 184. 11.5 16.1 1.21e-13 25 > 8 age40-59 25.8 18.9 1.37 1.85e- 1 25 > 9 age60-99 56.8 19.8 2.86 9.02e- 3 25 > 10 (Intercept) 178. 11.0 16.1 1.11e-13 25 > 11 age40-59 26.2 18.1 1.44 1.63e- 1 25 > 12 age60-99 26.7 19.1 1.40 1.76e- 1 25 > 13 (Intercept) 183. 13.0 14.1 1.70e-12 25 > 14 age40-59 17.5 21.4 0.817 4.22e- 1 25 > 15 age60-99 54.6 22.5 2.43 2.39e- 2 25
pool(fit)
> Class: mipo m = 5 > term m estimate ubar b t dfcom df riv > 1 (Intercept) 5 180.61667 132.0368 11.14792 145.4143 22 17.69014 0.10131643 > 2 age40-59 5 21.98333 358.3857 15.34230 376.7964 22 19.03236 0.05137138 > 3 age60-99 5 43.78333 396.1105 151.14792 577.4880 22 10.34211 0.45789625 > lambda fmi > 1 0.09199575 0.1797674 > 2 0.04886131 0.1352014 > 3 0.31408014 0.4169004